ongleyi
There is an error in my Rstudio when typing Chinese, so I will write my report in English.
Update 20180613
Modified the plot to insure the readability , including axis range, axis breaks, axis lable and the figure size.
Modified the typesetting of Rmd.
Added the content of Reflection.
Modified the code to make sure every row is less than 80 charactors.
Update: The typing error is still exist after I set the Preference. When I use my Chinese input method and type the word, it will display the alpha with “’” after first space key entered, and then display the Chinese after any key entered the second time.
like:
ni’hao你好 (I just type nihao for 1 time)
========================================================
I plot the distribution of every single variable and facet by quality.
We can see the in the data set, high quality and the low quality wine is much less. The data tends to be normally distributed.
The fixed acidity is most acids involved with wine or fixed or nonvolatile (do not evaporate readily), which is the skeleton of wine. The fixed acidity is concentrated in [6,10], and has a peak around 7. It can be considered as a normal distribution and the facet didn’t see much differences.
Volatile acidity is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. The volatile acidity is concentrated in [0.1,1.0], and has a peak around 0.5. It shows high quality wine has a lower peak. When the quality equals to 5 and 6, the count peak is larger than 0.4, but when the quality equals to 7, the peak is less than 0.4. This might be an evident single facotr related to the wine quality, we need to explore more.
Because the volatile acidity is evidently corresponded to the quality of wine, I want to explore more. When I search on the Internet, I found the balance of the taste is an important factor judging the quality of wine. So I was wondering the balance between acid smell and acid taste may influence the quality, so I create a new factor of volatile acidity ratio to measure the acid balance and plot the factor. But disappointedly, the new facotr didn’t work out. The distribution of volatile acidity is similiar to volatile acidity, even worse. The reason may be that volatile acidity is an unpleasent factor but the fixed acidity isn’t. People refresh the wine before drink at which the volatile acidity evaporates, so the balance between acid smell and acid taste cannot be moduled just by these two factor.
Citric acid is found in small quantities, citric acid can add ‘freshness’ and flavor to wines. It seems like more high quality wine has more citric acid.
Residual sugar is the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. The distribution shows that most of the wine in the data set is dry or semi-dry, and a very small amount of the sample are semi-sweet.
Chlorides is the amount of salt in the wine, it contains Chlorine. The distribution of chlorides didn’t show much difference in the facet. And I did an log-transform but it didn’t work either. The Chlorides itself is odorless, but under microbial action, it will transfer to Trichloroanisole(TCA). If the TCA is too much, the wine will have unpleasant smell like moldy old newspaper and wet cardboard. So I want to explore more about this factor next.
Free sulfur dioxide is the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. In low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wineToo much SO2 will make the wine smell like rotten eggs. The distribution didn’t show its direct impact on the quality, so I want to explore it combining other factors next.
Total sulfur dioxide is amount of free and bound forms of S02. Most of the wine in the data set have total sulfur dioxide less than 0.1g/dm^3.
The density of wine is close to that of water depending on the percent alcohol and sugar content. Density itself didn’t show strong correlation to quality, but people always say good wine have a ‘mellow’ taste, I will explore it combining the alcohol degree next.
PH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. It is relevent to acidity, and itself didn’t show strong correlation to quality. So I tend to give this factor up, and explore the acidity instead.
Sulphates is a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. The distribution doesn’t show much difference between different quality of wine. Because of its relevence to the SO2 level and antimicro, it is worth exploring.
Alcohol is the percent alcohol content of the wine. It affects not only density but also growing of microbe. It can be seen that good quality wine seems tend to be high alcohol content.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. :0.00100 Min. :0.00600
## 1st Qu.:0.07000 1st Qu.:0.00700 1st Qu.:0.02200
## Median :0.07900 Median :0.01400 Median :0.03800
## Mean :0.08747 Mean :0.01587 Mean :0.04647
## 3rd Qu.:0.09000 3rd Qu.:0.02100 3rd Qu.:0.06200
## Max. :0.61100 Max. :0.07200 Max. :0.28900
## density pH sulphates alcohol
## Min. : 990.1 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.: 995.6 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median : 996.8 Median :3.310 Median :0.6200 Median :10.20
## Mean : 996.7 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.: 997.8 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1003.7 Max. :4.010 Max. :2.0000 Max. :14.90
## quality total.acidity volatile.acidity.ratio
## Min. :3.000 Min. : 4.750 Min. :0.01283
## 1st Qu.:5.000 1st Qu.: 7.240 1st Qu.:0.04208
## Median :6.000 Median : 8.170 Median :0.06372
## Mean :5.636 Mean : 8.591 Mean :0.06543
## 3rd Qu.:6.000 3rd Qu.: 9.560 3rd Qu.:0.08465
## Max. :8.000 Max. :16.550 Max. :0.20789
## [1] 1599 14
I want to explore the data from following aspect: - Influence of SO2 : using free.sulfur.dioxide, total.sulfur.dioxide and sulphate. - Relationship between acid and sugar : using fixed.acidity, citric.acid and residual.sugar. - Relationship between chlorides and antimicrobial environment : using chlorides, residual.sugar, alcohol and sulphates(the last 3 factors are relevent to micro growing).
I try to find the relationship between quality and single factor. Actually it is little bit difficult because most of the factors are not strongly correlate to the quality. Here I display top3 factors that I think is most useful and rational.
We can see volatile acidity is negative to quality. But the best wine doesn’t have lowest volatile acidity. I guess there are 2 reasons: - The dataset of best quality wine are so small that causes the deviation. - The volatile acidity is not the only reason to judge the wine, so if the wine is excellent at other aspect but has acceptable high level volatile acidity, it still can be consider as high quality.
High quality wine has higher level of citric acid. Citric acid can add freshness and flavor in wine, which make the taste become rich.
High quality wine has higher level of alcohol. This may have following reasons: 1. High level of alcohol can protect the wine from microoraganism. 2. Wine which has high level of alcohol is more possible to be a good old wine.
I explore the data from the following 4 aspect: - Influence of SO2 : using free.sulfur.dioxide, total.sulfur.dioxide and sulphate. - Relationship between acid and sugar : using fixed.acidity, citric.acid and residual.sugar. - Relationship between chlorides and antimicrobial environment : using chlorides, residual.sugar, alcohol and sulphates(the last 3 factors are relevent to micro growing). - Other organic matter: Using density and alcohol to explore the other organic matter in the wine.
We know that too much SO2 smell like rotten eggs, but too less SO2 will cause micro-polution. So in some degree, SO2 is indispensable. But what the difference between good and bad quality wine in this aspect? To explore this issue, I create 2 new factors:
SO2 ratio so2.ratio = free.sulfur.dioxide / Total Volatile Organic Compound = free.sulfur.dioxide / (free.sulfur.dioxide + volatile.acidity) SO2 ratio can measure the obvious degree of SO2.
SO2 control The sulphates control the so3+ in the wine, which can transform to so2. This factor can be consider as the release amount of SO2 when freshing the wine before taste. so2.control = (total.sulfur.dioxide - free.sulfur.dioxide) / sulphates
The plot shows us good wine have higher SO2 ratio and the slope of SO2 ratio and SO2 control is much large. The slope means the release rate SO2 when refreshing. From the plot, we can tell: - Good wine has higher SO2 ratio which can protect the wine from micro-polution better. - Good wine has higher SO2 releasing rate, which can much quickly make SO2 reach to a high level and also, easily to escape when refreshing. In a word, much SO2 before refreshing maybe good, because it shows the anti-micro procedure is reliable. But after refreshing, too much SO2 can cause unpleasant feeling, thus this kind of wine tend to have lower grades.
The ratio between acid and sugar may affect the quality. So I created a new factor and plot it. acid.sugar.ratio = total.acidity / residual sugar
But disappointedly, it doesn’t work as I expect. It comes to my mind that red wine are divided by the residual sugar, so I bucket the data and boxplot acidity and quality.
The Chlorides itself is odorless, but under microbial action, it will transfer to Trichloroanisole(TCA). If the TCA is too much, the wine will have unpleasant smell like moldy old newspaper and wet cardboard. Here I create a factor that can measure the micro-growing environment. Sugar is consider to be positive to micro-growing, meanwhile alcohol and sulphates are negtive to micro-growing. The fomula is as following: micro.growing = residual.sugar / (alcohol * sulphates) Next I create the factor TCA.potential which is the product of micro.growing and chlorides. TCA.potential = micro.growing * chlorides Then I make a plot to figure out the relationship between TCA.potential and quality. Here I get a boxplot which shows TCA.potential is negtive to quality.
We know the density of red wine is less than water because of it contains alcohol and other organic matter. So I want to find the relationship between these organic matter and quality. First, I created a new factor names other.matter.density to measure the weight of wine removed alcohol and water. other.matter.density = density - 1000 * (alcohol/100*0.79 - (1 - alcohol/100*1)) The unit is g/dm^3.
The plot shows that the other.matter is negative to quality. The reason may be that the density of most great organic matter in wine(like acidity, tannin, etc.) has lower density than water, if these organic matter are higher, the wine is tend to have mellow taste.
free.sulfur.dioxide and total.sulfur.dioxide promote each other.
The factors which promote or inhibit the quality of wine have been explained in detail in the above.
We can see volatile acidity is negative to quality. But the best wine doesn’t have lowest volatile acidity. I guess there are 2 reasons: - The dataset of best quality wine are so small that causes the deviation. - The volatile acidity is not the only reason to judge the wine, so if the wine is excellent at other aspect but has acceptable high level volatile acidity, it still can be consider as high quality.
High quality wine has higher level of citric acid. Citric acid can add freshness and flavor in wine, which make the taste become rich.
High quality wine has higher level of alcohol. This may have following reasons: 1. High level of alcohol can protect the wine from microoraganism. 2. Wine which has high level of alcohol is more possible to be a good old wine.
We know that too much SO2 smell like rotten eggs, but too less SO2 will cause micro-polution. So in some degree, SO2 is indispensable. But what the difference between good and bad quality wine in this aspect? To explore this issue, I create 2 new factors:
SO2 ratio so2.ratio = free.sulfur.dioxide / Total Volatile Organic Compound = free.sulfur.dioxide / (free.sulfur.dioxide + volatile.acidity) SO2 ratio can measure the obvious degree of SO2.
SO2 control The sulphates control the so3+ in the wine, which can transform to so2. This factor can be consider as the release amount of SO2 when freshing the wine before taste. so2.control = (total.sulfur.dioxide - free.sulfur.dioxide) / sulphates
The plot shows us good wine have higher SO2 ratio and the slope of SO2 ratio and SO2 control is much large. The slope means the release rate SO2 when refreshing. From the plot, we can tell: - Good wine has higher SO2 ratio which can protect the wine from micro-polution better. - Good wine has higher SO2 releasing rate, which can much quickly make SO2 reach to a high level and also, easily to escape when refreshing. In a word, much SO2 before refreshing maybe good, because it shows the anti-micro procedure is reliable. But after refreshing, too much SO2 can cause unpleasant feeling, thus this kind of wine tend to have lower grades.
The ratio between acid and sugar may affect the quality. So I created a new factor and plot it. acid.sugar.ratio = total.acidity / residual sugar
But disappointedly, it doesn’t work as I expect. It comes to my mind that red wine are divided by the residual sugar, so I bucket the data and boxplot acidity and quality.
The Chlorides itself is odorless, but under microbial action, it will transfer to Trichloroanisole(TCA). If the TCA is too much, the wine will have unpleasant smell like moldy old newspaper and wet cardboard. Here I create a factor that can measure the micro-growing environment. Sugar is consider to be positive to micro-growing, meanwhile alcohol and sulphates are negtive to micro-growing. The fomula is as following: micro.growing = residual.sugar / (alcohol * sulphates) Next I create the factor TCA.potential which is the product of micro.growing and chlorides. TCA.potential = micro.growing * chlorides Then I make a plot to figure out the relationship between TCA.potential and quality. Here I get a boxplot which shows TCA.potential is negtive to quality.
We know the density of red wine is less than water because of it contains alcohol and other organic matter. So I want to find the relationship between these organic matter and quality. First, I created a new factor names other.matter.density to measure the weight of wine removed alcohol and water. other.matter.density = density - 1000 * (alcohol/100*0.79 - (1 - alcohol/100*1)) The unit is g/dm^3.
The plot shows that the other.matter is negative to quality. The reason may be that the density of most great organic matter in wine(like acidity, tannin, etc.) has lower density than water, if these organic matter are higher, the wine is tend to have mellow taste.
EDA not only need data processing technology, but also a very deep insight of the dataset. Just by this way can we get some interesting results.
Construct more factors: Actually I am not very satisfied with the factor – ‘other matter’, we can construct this variable further. Maybe minus the residual sugar or the chlorides. And after know more about the red wine, we can find more factors that might be useful.
Reconstruct the factors: Because some factors are highly correlated which is bad for modelling. So we can use PCA or some other method to refine the factors.
Modelling: In my opnion, the quality of wine doesn’t depend on its advantages but its flaws since the balance of taste is the key value judging a wine. So maybe we can use the negative value to construct a model to deal predict a wine’s quality. In this way, we can know how the factors influence the quality in a more mathematical way.